Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus

نویسندگان

  • Khalid Choukri
  • Mahtab Nikkhou
  • Niklas Paulsson
چکیده

Broadcast news is a very rich source of Language Resources that has been exploited to develop and assess a large set of Human Language Technologies. Some examples include systems to: automatically produce text transcriptions of spoken data; identify the language of a text; translate a text from one language to another; identify topics in the news and retrieve all stories discussing a target topic; retrieve stories directly from the broadcast audio and extract summaries of the content of news stories. BNSC is a broadcast news speech corpus developed in the framework of the European-funded project Network of Data Centres (NetDC). The corpus contains more than 20 hours of Arabic news recordings in modern standard Arabic. The news was recorded over a period of 3 months and were transcribed in Arabic script. The project was done in corporation with the LDC (Linguistic Data Consortium), which has produced a similar corpus of its Voice of America Arabic in the United States. This paper presents the BNSC corpus production from data collection to final product.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The need to create a media block for the convergence of overseas news networks

As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...

متن کامل

Matbn 2002: a Mandarin Chinese Broadcast News Corpus

The MATBN 2002 Mandarin Chinese broadcast news corpus contains a total of 40 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast domain. We expect to collect and process 220 hours of Mandarin Chine...

متن کامل

Arabic broadcast news transcription system

This paper describes the development of an Arabic broadcast news transcription system. The presented system is a speaker-independent large vocabulary natural Arabic speech recognition system, and it is intended to be a test bed for further research into the open ended problem of achieving natural language man-machine conversation. The system addresses a number of challenging issues pertaining t...

متن کامل

Testing a large corpus of natural standard Arabic for rhythm class

Previous studies using acoustic correlates to measure speech rhythm have used small samples of audio and a limited number of speakers. Few have included standard Arabic in the analysis. This study uses Arabic news broadcast along with data output from an automatic speech recognizer timealigned transcript to test over 50 minutes of speech by 46 speakers. The results show that Arabic, like Englis...

متن کامل

MATBN: A Mandarin Chinese Broadcast News Corpus

The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce the speech corpus and r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004